1 Description

This report is subset into 2 parts:

1.1 Clinical Context

Computed Tomography scanner (CT scan) is a widely spread and popular exam in oncology: it reflects the density of the tissues of the human body. It is, then, adapted to the study of lung cancer because lungs are mostly filled with air (low density) while tumors are made of dense tissues.

1.2 Clinical context

Small Cell Lung Cancer can itself be split into four major subtypes based on histology observations: squamous cell carcinoma, large cell carcinoma, adenocarcinoma and a mixture of all

1.3 Goal

Predict the survival time of a patient (remaining days to live) from one three-dimensional CT scan (grayscale image) and a set of pre-extracted quantitative imaging features, as well as clinical data.

1.4 dataset

To each patient corresponds one CT scan, and one binary segmentation mask. The segmentation mask is a binary volume of the same size as the CT scan, except that it is composed of zeroes everywhere there is no tumour, and 1 otherwise. The CT scans and the associated segmentation masks are subsets of two public datasets:

  • NSCLC Radiomics (subset of 285 patients)
  • NSCLC RadioGenomics(subset of 141 patients)

Both training and validation contain for each patient, the time to event (days), as well as the censorship. Censorship indicates whether the event (death) was observed or whether the patient escaped the study: this can happen when the patient’s track was lost, or if the patient died of causes not related to the disease.

2 Tumor Arrays Slides Exploration

2.1 Setting python version and anaconda environment for R :-)

## python:         /Users/Mezhoud/anaconda3/bin/python3
## libpython:      /Users/Mezhoud/anaconda3/lib/libpython3.7m.dylib
## pythonhome:     /Users/Mezhoud/anaconda3:/Users/Mezhoud/anaconda3
## version:        3.7.5 (default, Oct 25 2019, 10:52:18)  [Clang 4.0.1 (tags/RELEASE_401/final)]
## numpy:          /Users/Mezhoud/anaconda3/lib/python3.7/site-packages/numpy
## numpy_version:  1.17.3
## 
## NOTE: Python version was forced by use_python function

2.2 Load scans and masks of Tumor lung cancer

## the dimension of scan array is:  (92, 92, 92)
## the dimension of mask array is:  (92, 92, 92)
## plot some images from patient 002:

2.2.1 Function to plot multiple image from array

The plot shows colored images scan of 6 slides. At this step it is not easy to distinguish the tumor.

The dataset has aslo the masks for each scan slide which locate the position of the tumor in the scan.

  • The first 3 slides do not have tumor streak, however the next 3 ones indicate the position of the tumor in red color.
  • If we plot more slides, we can observe the increase of the size of the tumor during plotting slides.
  • At the end the size the Tumor is decreasing.

  • We can note that the crop is adjusted to the size of the tumor

If we compare with the scan slides, we obtain:

  • It is always not easy to delimit the tumor in scan images

  • Comparing to masks, we can note that, between scan 34 and scan 65, the slides have more yellow stain or less blue color.

2.3 Load images from test dataset

  • Plot mask slides from test dataset

  • In the test images, we can also observe tumor slides like in train dataset.

  • For training step, it maybe better to use masks slides than scan. But we need to explore variables in clinical data and radiomics and think how to associate images with numeric variables.

  • One think we can do is the convert slides to dataframe (each slide in one row) and then we can obtain one matrix for each patient tumor.

  • At this step I will switch from python to R :-)

2.4 Import image from python environment to R

The goal of this step is to convert image matrices as vector. So, each image can be ranged in one row. Finally, we can obtain one dataframe with 92 rows (images) ofr each sample (patient).

2.4.1 Import useful R packages

2.4.1.2 Understanding the structure of the array of images

## [1] "One image is a: matrix"
## [1] "Two images are an: array"
## [1] "Print the first 10 pixels of Scan N°1: "
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,] -777 -759 -707 -697 -749 -796 -826 -837 -858  -860
##  [2,] -783 -791 -774 -768 -787 -808 -827 -826 -829  -829
##  [3,] -804 -841 -827 -812 -820 -840 -831 -801 -792  -794
##  [4,] -830 -857 -839 -816 -805 -818 -801 -764 -734  -722
##  [5,] -844 -854 -843 -823 -799 -787 -771 -743 -704  -670
##  [6,] -844 -849 -844 -831 -821 -810 -809 -791 -760  -719
##  [7,] -848 -847 -848 -844 -847 -841 -849 -841 -829  -806
##  [8,] -847 -856 -854 -846 -840 -813 -826 -849 -848  -836
##  [9,] -840 -851 -836 -823 -835 -796 -803 -853 -857  -833
## [10,] -847 -841 -829 -817 -860 -832 -799 -842 -865  -838
## [1] "Print the first 10 pixels of Mask N°1: "
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
##  [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

2.4.1.4 Convert image matrix to vector

## [1] "The output is a: list"
## [1] "With length of: 2"
## [1] "The names of two elements are: "
## [1] "patient_002_scan" "patient_002_mask"
## [1] "which are: data.frame"
## [1] "The dimension of each dataframe is: "
## [1]   92 8464
  • At this step we stop exploring scan and mask.

  • We think to use only masks for modeling

  • Potential method: keras, mxnet

3 Exploratory Data Analysis of radiomics and clinical data

##           V1                          V2                          V3
## 1:                                 shape                       shape
## 2:           original_shape_Compactness1 original_shape_Compactness2
## 3: PatientID                                                        
## 4:       202                 0.027815034                 0.274891585
## 5:       371                  0.02301549                 0.188210005
## 6:       246                 0.027348106                 0.265739895
##                                  V4                                    V5
## 1:                            shape                                 shape
## 2: original_shape_Maximum3DDiameter original_shape_SphericalDisproportion
## 3:                                                                       
## 4:                      48.55924217                           1.537964054
## 5:                      75.70336849                           1.744961158
## 6:                      70.43436661                           1.555420243
##                           V6                         V7
## 1:                     shape                      shape
## 2: original_shape_Sphericity original_shape_SurfaceArea
## 3:                                                     
## 4:               0.650210255                 5431.33321
## 5:               0.573078659                10369.56873
## 6:               0.642913068                10558.81869
##                                   V8
## 1:                             shape
## 2: original_shape_SurfaceVolumeRatio
## 3:                                  
## 4:                       0.275227763
## 5:                       0.240726824
## 6:                       0.200765988
##    PatientID               Histology Mstage Nstage SourceDataset Tstage     age
## 1:       202          Adenocarcinoma      0      0            l2      2 66.0000
## 2:       371              large cell      0      2            l1      4 64.5722
## 3:       246 squamous cell carcinoma      0      3            l1      2 66.0452
## 4:       240                     nos      0      2            l1      3 59.3566
## 5:       284 squamous cell carcinoma      0      3            l1      4 71.0554
## 6:       348 squamous cell carcinoma      0      2            l1      2 65.0212

The radiomics features can be divided into 4 groups as follows (shown in row 1): - Group 1. First order statistics - Group 2. Shape and size based features - Group 3. Textural features - Group 4. Wavelet features

Each group can be subset into several sub-groups shown in row 2 of the radiomics dataset. To make the radiomics features numeric dataset we need to remove the two first rows and convert them to colnames.

##    Groups                              Features
## V2  shape           original_shape_Compactness1
## V3  shape           original_shape_Compactness2
## V4  shape      original_shape_Maximum3DDiameter
## V5  shape original_shape_SphericalDisproportion
## V6  shape             original_shape_Sphericity
## V7  shape            original_shape_SurfaceArea

To improve the esthetic of the dataframe, we note:

3.1 Plot the distribution of Goups and features

3.1.1 Set New Colnames of radiomics

## [1] "shape_Compactness1"           "shape_Compactness2"          
## [3] "shape_Maximum3DDiameter"      "shape_SphericalDisproportion"
## [5] "shape_Sphericity"             "shape_SurfaceArea"

3.1.2 Get new radiomics style

##   PatientID shape_Compactness1 shape_Compactness2 shape_Maximum3DDiameter
## 1       202         0.02781503          0.2748916                48.55924
## 2       371         0.02301549          0.1882100                75.70337
## 3       246         0.02734811          0.2657399                70.43437
## 4       240         0.02681111          0.2554064                46.81880
## 5       284         0.02369124          0.1994242                53.79591
## 6       348         0.03098136          0.3410383                63.74951
##   shape_SphericalDisproportion shape_Sphericity shape_SurfaceArea
## 1                     1.537964        0.6502103          5431.333
## 2                     1.744961        0.5730787         10369.569
## 3                     1.555420        0.6429131         10558.819
## 4                     1.576120        0.6344693          4221.412
## 5                     1.711620        0.5842418          5295.900
## 6                     1.431305        0.6986630          8493.134
##   shape_SurfaceVolumeRatio
## 1                0.2752278
## 2                0.2407268
## 3                0.2007660
## 4                0.3238780
## 5                0.3272407
## 6                0.1976017

3.2 Glimpse correlation between features (default order)

3.2.1 Set the first principal component order of the features

3.2.2 Set the hierarchical clustering order of the features

  • We can return to these heatmap when we predict the most importante features using modeling.

3.3 Explore Clinical data

  • The most frequente cases is Adenocarcinoma followed by Aquamous Cell Carcinoma.

  • NOS: not otherwise specified

  • It seems NOS and Nsclc Nos corespond to the same category

  • Nan is not available ?

  • The density plot shows that the must frequent cases are 65, 70, 72 years old.

_ The most frequent Nstage class is also 0, followed by 2, 3, and 1.

  • The third plots shows that the most cases are in Mstage == 0. We can focus only in this class.

  • There are two sources of dataset.

3.4 Explore output_train and output_test

##    PatientID SurvivalTime Event
## 1:       202         1378     0
## 2:       371          379     1
## 3:       246          573     1
## 4:       240          959     0
## 5:       284         2119     0
## 6:       348          706     1
##    PatientID SurvivalTime Event
## 1:        13     788.4177   NaN
## 2:       155     427.6501   NaN
## 3:       404     173.5872   NaN
## 4:       407     389.8780   NaN
## 5:         9    1580.7672   NaN
## 6:        49     472.5234   NaN
  • The goal is the fill Event variable in output_test by 0 or 1.

4 Preprocessing of Train and Test dataset

The output of this section is to clean and unify variables and merge clinical, radimoics, and output_train dataset.

4.1 Train wrangling

##    PatientID Event Histology Mstage Nstage SourceDataset Tstage     age
## 1:       202     0         1      0      0             2      2 66.0000
## 2:       371     1         2      0      2             1      4 64.5722
## 3:       246     1         6      0      3             1      2 66.0452
## 4:       240     0         4      0      2             1      3 59.3566
## 5:       284     0         6      0      3             1      4 71.0554
## 6:       348     1         6      0      2             1      2 65.0212
##    SurvivalTime shape_Compactness1
## 1:         1378         0.02781503
## 2:          379         0.02301549
## 3:          573         0.02734811
## 4:          959         0.02681111
## 5:         2119         0.02369124
## 6:          706         0.03098136

4.1.1 Explore missing value in train

  • There are 18 missing age from 300.

4.2 Test wrangling

4.2.1 Transform radiomics test dataset

##   PatientID shape_Compactness1 shape_Compactness2 shape_Maximum3DDiameter
## 1        13         0.02888522         0.29645143               106.90182
## 2       155         0.03194837         0.36266005                18.81489
## 3       404         0.01599883         0.09094503               105.08092
## 4       407         0.03135766         0.34937318                46.96807
## 5         9         0.01781454         0.11275905                56.54202
## 6        49         0.03816202         0.51744596                20.12461
##   shape_SphericalDisproportion shape_Sphericity shape_SurfaceArea
## 1                     1.499738        0.6667830        29085.5414
## 2                     1.402276        0.7131265          629.4436
## 3                     2.223687        0.4497036        12509.2654
## 4                     1.419832        0.7043089         4067.6574
## 5                     2.069901        0.4831149         7093.3657
## 6                     1.245599        0.8028264          844.2344
##   shape_SurfaceVolumeRatio
## 1                0.1145278
## 2                0.7038788
## 3                0.3152977
## 4                0.2821040
## 5                0.3760316
## 6                0.5088176

4.2.3 Merge clinical, radimoics, and output_test dataset

##    PatientID Event Histology Mstage Nstage SourceDataset Tstage     age
## 1:        13   NaN         4      0      0             1      4 44.3970
## 2:       155   NaN         1      0      3             1      1 63.3183
## 3:       404   NaN         2      0      2             1      2 64.7255
## 4:       407   NaN         4      0      0             1      2 65.3635
## 5:         9   NaN         1      0      0             2      2 50.0000
## 6:        49   NaN         6      0      0             1      2 86.1410
##    SurvivalTime shape_Compactness1
## 1:     788.4177         0.02888522
## 2:     427.6501         0.03194837
## 3:     173.5872         0.01599883
## 4:     389.8780         0.03135766
## 5:    1580.7672         0.01781454
## 6:     472.5234         0.03816202

4.2.3.1 Explore missing value in test

  • There are 4 missing age from 125.

5 Xgboost modeling

5.2 Split Train dataset into Train & Valid sets

## <241/59/300>
  • We can retrieve our training and testing sets using training() and testing() functions.
##     Event  Histology     Mstage     Nstage SourceDataset     Tstage        age
##  1:     0 -1.0235305 -0.1207812 -0.8392371     1.4175490 -0.1386218 -0.2504355
##  2:     1 -0.5217999 -0.1207812  0.8392371    -0.7037831  1.7024495 -0.3972837
##  3:     1  1.4851227 -0.1207812  1.6784742    -0.7037831 -0.1386218 -0.2457867
##  4:     0  1.4851227 -0.1207812  1.6784742    -0.7037831  1.7024495  0.2695086
##  5:     1  1.4851227 -0.1207812  0.8392371    -0.7037831 -0.1386218 -0.3511044
##  6:     1 -1.0235305 -0.1207812 -0.8392371     1.4175490 -1.0591575  0.1609615
##  7:     1 -1.0235305 -0.1207812 -0.8392371    -0.7037831  1.7024495  0.6186818
##  8:     1 -1.0235305 -0.1207812  0.8392371    -0.7037831  1.7024495 -1.5788159
##  9:     0 -0.5217999 -0.1207812  0.8392371    -0.7037831 -0.1386218 -2.0372357
## 10:     1  1.4851227 -0.1207812  0.8392371     1.4175490 -0.1386218  1.1894541
##     SurvivalTime shape_Compactness1 shape_Compactness2
##  1:    0.7487397          0.3273910          0.2206572
##  2:   -0.7068626         -0.4508739         -0.5451008
##  3:   -0.4241931          0.2516768          0.1398098
##  4:    1.8284207         -0.3412982         -0.4460329
##  5:   -0.2304042          0.8408228          0.8050072
##  6:    0.7356262         -0.8618943         -0.8911634
##  7:   -0.9720474          0.3619137          0.2579747
##  8:   -0.9735045         -1.1924952         -1.1402475
##  9:    0.3815607          0.9177119          0.8979347
## 10:   -0.7345467         -0.0934834         -0.2114097

5.3 Format train and test to DMatrix

## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice

5.4 Optimize features with Cross validation

Here, we can see after how many rounds, we achieved the smallest test error.

## [1]  train-auc:0.744572+0.064073 test-auc:0.702438+0.082770 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 1000 rounds.
## 
## [101]    train-auc:0.870199+0.010406 test-auc:0.719389+0.054983 
## [201]    train-auc:0.908392+0.009288 test-auc:0.738096+0.067038 
## [301]    train-auc:0.937980+0.009603 test-auc:0.742856+0.073792 
## [401]    train-auc:0.958381+0.007234 test-auc:0.745781+0.075328 
## [501]    train-auc:0.971117+0.005513 test-auc:0.748186+0.078245 
## [601]    train-auc:0.981932+0.003549 test-auc:0.748662+0.073363 
## [701]    train-auc:0.989035+0.002636 test-auc:0.751144+0.073269 
## [801]    train-auc:0.993115+0.002251 test-auc:0.752468+0.075049 
## [901]    train-auc:0.995830+0.001267 test-auc:0.752491+0.073045 
## [1001]   train-auc:0.997458+0.000986 test-auc:0.750383+0.075110 
## [1101]   train-auc:0.998436+0.000822 test-auc:0.752745+0.074776 
## [1201]   train-auc:0.999304+0.000382 test-auc:0.754863+0.075870 
## [1301]   train-auc:0.999609+0.000281 test-auc:0.757606+0.074092 
## [1401]   train-auc:0.999652+0.000233 test-auc:0.758671+0.077658 
## [1501]   train-auc:0.999847+0.000149 test-auc:0.757243+0.077075 
## [1601]   train-auc:0.999913+0.000128 test-auc:0.760429+0.080661 
## [1701]   train-auc:0.999978+0.000043 test-auc:0.761804+0.078747 
## [1801]   train-auc:1.000000+0.000000 test-auc:0.761781+0.079641 
## [1901]   train-auc:1.000000+0.000000 test-auc:0.764569+0.077992 
## [2001]   train-auc:1.000000+0.000000 test-auc:0.763607+0.077886 
## [2101]   train-auc:1.000000+0.000000 test-auc:0.764313+0.077829 
## [2201]   train-auc:1.000000+0.000000 test-auc:0.765058+0.078533 
## [2301]   train-auc:1.000000+0.000000 test-auc:0.764752+0.080319 
## [2401]   train-auc:1.000000+0.000000 test-auc:0.765500+0.082560 
## [2501]   train-auc:1.000000+0.000000 test-auc:0.765848+0.082679 
## [2601]   train-auc:1.000000+0.000000 test-auc:0.765853+0.083729 
## [2701]   train-auc:1.000000+0.000000 test-auc:0.767584+0.084333 
## [2801]   train-auc:1.000000+0.000000 test-auc:0.767564+0.083388 
## [2901]   train-auc:1.000000+0.000000 test-auc:0.768261+0.082905 
## [3001]   train-auc:1.000000+0.000000 test-auc:0.767957+0.082808 
## [3101]   train-auc:1.000000+0.000000 test-auc:0.767907+0.083715 
## [3201]   train-auc:1.000000+0.000000 test-auc:0.767216+0.083291 
## [3301]   train-auc:1.000000+0.000000 test-auc:0.767566+0.083038 
## [3401]   train-auc:1.000000+0.000000 test-auc:0.766488+0.083823 
## [3501]   train-auc:1.000000+0.000000 test-auc:0.765769+0.085249 
## [3601]   train-auc:1.000000+0.000000 test-auc:0.765775+0.084047 
## [3701]   train-auc:1.000000+0.000000 test-auc:0.766821+0.083965 
## [3801]   train-auc:1.000000+0.000000 test-auc:0.768238+0.083915 
## [3901]   train-auc:1.000000+0.000000 test-auc:0.768253+0.083865 
## [4001]   train-auc:1.000000+0.000000 test-auc:0.769293+0.084696 
## [4101]   train-auc:1.000000+0.000000 test-auc:0.767554+0.085023 
## [4201]   train-auc:1.000000+0.000000 test-auc:0.767911+0.085678 
## [4301]   train-auc:1.000000+0.000000 test-auc:0.769295+0.084663 
## [4401]   train-auc:1.000000+0.000000 test-auc:0.768966+0.083995 
## [4501]   train-auc:1.000000+0.000000 test-auc:0.769337+0.084664 
## [4601]   train-auc:1.000000+0.000000 test-auc:0.770063+0.084524 
## [4701]   train-auc:1.000000+0.000000 test-auc:0.770762+0.086182 
## [4801]   train-auc:1.000000+0.000000 test-auc:0.770762+0.086182 
## [4901]   train-auc:1.000000+0.000000 test-auc:0.770733+0.085839 
## [5001]   train-auc:1.000000+0.000000 test-auc:0.771431+0.085685 
## [5101]   train-auc:1.000000+0.000000 test-auc:0.772436+0.084950 
## [5201]   train-auc:1.000000+0.000000 test-auc:0.771077+0.084316 
## [5301]   train-auc:1.000000+0.000000 test-auc:0.771434+0.084976 
## [5401]   train-auc:1.000000+0.000000 test-auc:0.771427+0.085900 
## [5501]   train-auc:1.000000+0.000000 test-auc:0.772086+0.084872 
## [5601]   train-auc:1.000000+0.000000 test-auc:0.772449+0.084947 
## [5701]   train-auc:1.000000+0.000000 test-auc:0.772082+0.085432 
## [5801]   train-auc:1.000000+0.000000 test-auc:0.773126+0.085734 
## [5901]   train-auc:1.000000+0.000000 test-auc:0.773796+0.084919 
## [6001]   train-auc:1.000000+0.000000 test-auc:0.775868+0.085087 
## [6101]   train-auc:1.000000+0.000000 test-auc:0.775889+0.086328 
## [6201]   train-auc:1.000000+0.000000 test-auc:0.775875+0.086345 
## [6301]   train-auc:1.000000+0.000000 test-auc:0.775168+0.085948 
## [6401]   train-auc:1.000000+0.000000 test-auc:0.775168+0.085948 
## [6501]   train-auc:1.000000+0.000000 test-auc:0.775860+0.086386 
## [6601]   train-auc:1.000000+0.000000 test-auc:0.775527+0.086600 
## [6701]   train-auc:1.000000+0.000000 test-auc:0.775838+0.086254 
## [6801]   train-auc:1.000000+0.000000 test-auc:0.776898+0.085454 
## [6901]   train-auc:1.000000+0.000000 test-auc:0.777580+0.085361 
## [7001]   train-auc:1.000000+0.000000 test-auc:0.778599+0.084680 
## [7101]   train-auc:1.000000+0.000000 test-auc:0.777915+0.085147 
## [7201]   train-auc:1.000000+0.000000 test-auc:0.778264+0.084891 
## [7301]   train-auc:1.000000+0.000000 test-auc:0.778264+0.084891 
## [7401]   train-auc:1.000000+0.000000 test-auc:0.778236+0.085371 
## [7501]   train-auc:1.000000+0.000000 test-auc:0.778251+0.084939 
## [7601]   train-auc:1.000000+0.000000 test-auc:0.778565+0.083907 
## [7701]   train-auc:1.000000+0.000000 test-auc:0.778585+0.085129 
## [7801]   train-auc:1.000000+0.000000 test-auc:0.778897+0.084486 
## [7901]   train-auc:1.000000+0.000000 test-auc:0.778585+0.085129 
## [8001]   train-auc:1.000000+0.000000 test-auc:0.779602+0.084841 
## [8101]   train-auc:1.000000+0.000000 test-auc:0.778905+0.085364 
## [8201]   train-auc:1.000000+0.000000 test-auc:0.779254+0.085127 
## [8301]   train-auc:1.000000+0.000000 test-auc:0.779268+0.085049 
## [8401]   train-auc:1.000000+0.000000 test-auc:0.778200+0.084589 
## [8501]   train-auc:1.000000+0.000000 test-auc:0.777856+0.083875 
## [8601]   train-auc:1.000000+0.000000 test-auc:0.778190+0.083670 
## [8701]   train-auc:1.000000+0.000000 test-auc:0.778190+0.083670 
## [8801]   train-auc:1.000000+0.000000 test-auc:0.777508+0.084157 
## [8901]   train-auc:1.000000+0.000000 test-auc:0.777158+0.084397 
## [9001]   train-auc:1.000000+0.000000 test-auc:0.777158+0.084397 
## [9101]   train-auc:1.000000+0.000000 test-auc:0.777174+0.084371 
## [9201]   train-auc:1.000000+0.000000 test-auc:0.776489+0.084827 
## Stopping. Best iteration:
## [8221]   train-auc:1.000000+0.000000 test-auc:0.779952+0.084599
## Time difference of 38.73734 secs

5.5 Train the model

## [1]  train-auc:0.693182  eval-auc:0.586207 
## [501]    train-auc:0.957117  eval-auc:0.841379 
## [1001]   train-auc:0.993397  eval-auc:0.843678 
## [1501]   train-auc:0.999096  eval-auc:0.866667 
## [2001]   train-auc:0.999931  eval-auc:0.872414 
## [2501]   train-auc:1.000000  eval-auc:0.878161 
## [3001]   train-auc:1.000000  eval-auc:0.881609 
## [3501]   train-auc:1.000000  eval-auc:0.880460 
## [4001]   train-auc:1.000000  eval-auc:0.880460 
## [4501]   train-auc:1.000000  eval-auc:0.881609 
## [5001]   train-auc:1.000000  eval-auc:0.877011 
## [5501]   train-auc:1.000000  eval-auc:0.879310 
## [6001]   train-auc:1.000000  eval-auc:0.880460 
## [6501]   train-auc:1.000000  eval-auc:0.879310 
## [7001]   train-auc:1.000000  eval-auc:0.877011 
## [7501]   train-auc:1.000000  eval-auc:0.874713 
## [8001]   train-auc:1.000000  eval-auc:0.872414 
## [8221]   train-auc:1.000000  eval-auc:0.871264
## Time difference of 8.603443 secs

5.6 Predict valid_2 dataset

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0008307 0.0456546 0.5756907 0.4909866 0.8893077 0.9995762
  • We suppose that if Prob > 0.5 (Median), the Event is 1, else 0

5.7 Transform propability to binary classification

## pred_bin
##  0  1 
## 28 31

5.8 Confusion matrix for Tree model

## # A tibble: 4 x 3
##   prediction label     n
##        <dbl> <dbl> <int>
## 1          0     0    22
## 2          0     1     6
## 3          1     0     7
## 4          1     1    24

5.9 Extract the most important features from tree xgboost model

5.9.1 List the most important features

##                                 Feature         Gain        Cover    Frequency
##  1:                        SurvivalTime 0.1473016576 0.0606032617 0.0443522942
##  2:        glrlm_GrayLevelNonUniformity 0.0361461906 0.0329921725 0.0303795186
##  3:                                 age 0.0293497900 0.0344337912 0.0280356982
##  4:                    glcm_Correlation 0.0293344404 0.0316390293 0.0316415758
##  5:              glrlm_ShortRunEmphasis 0.0242824351 0.0211755591 0.0200126206
##  6:                    firstorder_Range 0.0237971560 0.0273691845 0.0287118002
##  7:             glcm_MaximumProbability 0.0233000328 0.0240000228 0.0250608492
##  8:                   glcm_ClusterShade 0.0226094231 0.0247564691 0.0221310737
##  9:  glrlm_ShortRunLowGrayLevelEmphasis 0.0218925972 0.0223398825 0.0200126206
## 10:        glrlm_RunLengthNonUniformity 0.0217701604 0.0247601084 0.0232579104
## 11:                   firstorder_Median 0.0206514365 0.0202149659 0.0173983593
## 12:                           glcm_Imc2 0.0189165892 0.0241984645 0.0246551880
## 13:             shape_Maximum3DDiameter 0.0183020510 0.0208021222 0.0216803390
## 14:       glrlm_LowGrayLevelRunEmphasis 0.0182646503 0.0207184040 0.0198323267
## 15:                   shape_SurfaceArea 0.0181659458 0.0176602938 0.0182096818
## 16:                           glcm_Imc1 0.0179234187 0.0181160355 0.0160461552
## 17:                   firstorder_Energy 0.0178565413 0.0203080998 0.0206887226
## 18:   glrlm_LongRunLowGrayLevelEmphasis 0.0170534424 0.0207731122 0.0201027675
## 19:                     firstorder_Mean 0.0164444669 0.0193970561 0.0187956369
## 20:                             glcm_Id 0.0161800000 0.0137571279 0.0128910124
## 21:              glcm_ClusterProminence 0.0160688179 0.0204573920 0.0203732083
## 22:                            glcm_Idm 0.0160089908 0.0146376942 0.0150094654
## 23:              glcm_DifferenceAverage 0.0157883970 0.0146849868 0.0157306409
## 24:              glcm_DifferenceEntropy 0.0151487822 0.0169302018 0.0162264491
## 25:                glcm_InverseVariance 0.0151286284 0.0193457245 0.0237987920
## 26:                     glcm_SumAverage 0.0149163557 0.0180440007 0.0180744614
## 27:                       glcm_Contrast 0.0147771600 0.0168717437 0.0160461552
## 28:                 glrlm_RunPercentage 0.0147585440 0.0126227537 0.0126656450
## 29:            shape_SurfaceVolumeRatio 0.0144804480 0.0133940707 0.0146488777
## 30:                              Nstage 0.0142376516 0.0173331222 0.0131163797
## 31:                           glcm_Idmn 0.0142023419 0.0121121202 0.0114486613
## 32:  glrlm_LongRunHighGrayLevelEmphasis 0.0141824721 0.0155633959 0.0180293879
## 33:                  shape_Compactness1 0.0134119067 0.0166033858 0.0192463716
## 34:                   shape_VoxelVolume 0.0130344702 0.0124567191 0.0123050572
## 35:                glcm_Autocorrelation 0.0129387816 0.0142384976 0.0139277022
## 36: glrlm_ShortRunHighGrayLevelEmphasis 0.0128974180 0.0186926076 0.0205985757
## 37:    firstorder_MeanAbsoluteDeviation 0.0125793229 0.0142261158 0.0152348328
## 38:        shape_SphericalDisproportion 0.0124878567 0.0146265292 0.0167222573
## 39:                    shape_Sphericity 0.0123218032 0.0149262125 0.0159560083
## 40:                  firstorder_Maximum 0.0123047946 0.0179208467 0.0198323267
## 41:                            glcm_Idn 0.0121079516 0.0126515402 0.0126656450
## 42:                 firstorder_Skewness 0.0120363070 0.0131467075 0.0149193185
## 43:                    glcm_JointEnergy 0.0118810705 0.0091120524 0.0094203552
## 44:          firstorder_RootMeanSquared 0.0112659617 0.0140719255 0.0155052736
## 45:                  shape_Compactness2 0.0112622210 0.0149530392 0.0173082124
## 46:                  firstorder_Entropy 0.0100879556 0.0120711812 0.0137023348
## 47:                  firstorder_Minimum 0.0098321660 0.0125515927 0.0144235103
## 48:                 firstorder_Kurtosis 0.0093894713 0.0116019425 0.0141079960
## 49:                   glcm_JointEntropy 0.0091521962 0.0096012586 0.0107725593
## 50:                 firstorder_Variance 0.0090218473 0.0101489612 0.0100063103
## 51:      glrlm_HighGrayLevelRunEmphasis 0.0089822955 0.0115605924 0.0135671144
## 52:               glrlm_LongRunEmphasis 0.0079809619 0.0083208852 0.0089245470
## 53:        firstorder_StandardDeviation 0.0079070265 0.0101078093 0.0107274858
## 54:                     glcm_SumEntropy 0.0075818420 0.0094132818 0.0102767511
## 55:               firstorder_Uniformity 0.0075086249 0.0090427880 0.0108176327
## 56:                glcm_ClusterTendency 0.0073854350 0.0089882640 0.0105922654
## 57:                           Histology 0.0028452880 0.0044211127 0.0059496980
## 58:                              Tstage 0.0018435905 0.0019519711 0.0030199225
## 59:                       SourceDataset 0.0007104185 0.0005798108 0.0004056612
##                                 Feature         Gain        Cover    Frequency
  • Survival Time is the most important feature, followed by age, and 8 texture description.

5.10 Test Xgboost Prediction

5.10.1 Load test data and format to DMatrix

## [1] 0.22829722 0.94555700 0.93745029 0.98050964 0.01068046 0.93734211
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00421 0.17358 0.63923 0.53808 0.89808 0.99944

5.11 submission

5.11.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1

6 Random Forest Survival model (ranger package)

This method will give us an outcome probability over a time continuum (flipping the non-event to event probability)

Let’s take a quick look at the time period range in the training portion of our data set:

6.1 Dealing with missing Age

6.1.1 Check the data for missing values.

##                           PatientID                               Event 
##                                   0                                   0 
##                           Histology                              Mstage 
##                                   0                                   0 
##                              Nstage                       SourceDataset 
##                                   0                                   0 
##                              Tstage                                 age 
##                                   0                                  16 
##                        SurvivalTime                  shape_Compactness1 
##                                   0                                   0 
##                  shape_Compactness2             shape_Maximum3DDiameter 
##                                   0                                   0 
##        shape_SphericalDisproportion                    shape_Sphericity 
##                                   0                                   0 
##                   shape_SurfaceArea            shape_SurfaceVolumeRatio 
##                                   0                                   0 
##                   shape_VoxelVolume                   firstorder_Energy 
##                                   0                                   0 
##                  firstorder_Entropy                 firstorder_Kurtosis 
##                                   0                                   0 
##                  firstorder_Maximum                     firstorder_Mean 
##                                   0                                   0 
##    firstorder_MeanAbsoluteDeviation                   firstorder_Median 
##                                   0                                   0 
##                  firstorder_Minimum                    firstorder_Range 
##                                   0                                   0 
##          firstorder_RootMeanSquared                 firstorder_Skewness 
##                                   0                                   0 
##        firstorder_StandardDeviation               firstorder_Uniformity 
##                                   0                                   0 
##                 firstorder_Variance                glcm_Autocorrelation 
##                                   0                                   0 
##              glcm_ClusterProminence                   glcm_ClusterShade 
##                                   0                                   0 
##                glcm_ClusterTendency                       glcm_Contrast 
##                                   0                                   0 
##                    glcm_Correlation              glcm_DifferenceEntropy 
##                                   0                                   0 
##              glcm_DifferenceAverage                    glcm_JointEnergy 
##                                   0                                   0 
##                   glcm_JointEntropy                             glcm_Id 
##                                   0                                   0 
##                            glcm_Idm                           glcm_Imc1 
##                                   0                                   0 
##                           glcm_Imc2                           glcm_Idmn 
##                                   0                                   0 
##                            glcm_Idn                glcm_InverseVariance 
##                                   0                                   0 
##             glcm_MaximumProbability                     glcm_SumAverage 
##                                   0                                   0 
##                     glcm_SumEntropy              glrlm_ShortRunEmphasis 
##                                   0                                   0 
##               glrlm_LongRunEmphasis        glrlm_GrayLevelNonUniformity 
##                                   0                                   0 
##        glrlm_RunLengthNonUniformity                 glrlm_RunPercentage 
##                                   0                                   0 
##       glrlm_LowGrayLevelRunEmphasis      glrlm_HighGrayLevelRunEmphasis 
##                                   0                                   0 
##  glrlm_ShortRunLowGrayLevelEmphasis glrlm_ShortRunHighGrayLevelEmphasis 
##                                   0                                   0 
##   glrlm_LongRunLowGrayLevelEmphasis  glrlm_LongRunHighGrayLevelEmphasis 
##                                   0                                   0

6.1.2 Imputation processing for train data

## Loading required package: lattice
## 
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
## 
##     complete
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Warning: Number of logged events: 1
  • We may not want to use a certain variable as predictors. For example, the PatitenID variable does not have any predictive value.
  • If we want to skip a variable from imputation use the code below. This variable will be used for prediction.
  • Now let specify the methods for imputing the missing values. There are specific methods for continues, binary and ordinal variables. I set different methods for each variable. You can add more than one variable in each method.
  • Now it is time to run the multiple (m=5) imputation.
## 
##  iter imp variable
##   1   1  age
##   1   2  age
##   1   3  age
##   1   4  age
##   1   5  age
##   2   1  age
##   2   2  age
##   2   3  age
##   2   4  age
##   2   5  age
##   3   1  age
##   3   2  age
##   3   3  age
##   3   4  age
##   3   5  age
##   4   1  age
##   4   2  age
##   4   3  age
##   4   4  age
##   4   5  age
##   5   1  age
##   5   2  age
##   5   3  age
##   5   4  age
##   5   5  age
## Warning: Number of logged events: 25
  • Create a dataset after imputation.
  • Check for missings in the imputed dataset.
##                           PatientID                               Event 
##                                   0                                   0 
##                           Histology                              Mstage 
##                                   0                                   0 
##                              Nstage                       SourceDataset 
##                                   0                                   0 
##                              Tstage                                 age 
##                                   0                                   0 
##                        SurvivalTime                  shape_Compactness1 
##                                   0                                   0 
##                  shape_Compactness2             shape_Maximum3DDiameter 
##                                   0                                   0 
##        shape_SphericalDisproportion                    shape_Sphericity 
##                                   0                                   0 
##                   shape_SurfaceArea            shape_SurfaceVolumeRatio 
##                                   0                                   0 
##                   shape_VoxelVolume                   firstorder_Energy 
##                                   0                                   0 
##                  firstorder_Entropy                 firstorder_Kurtosis 
##                                   0                                   0 
##                  firstorder_Maximum                     firstorder_Mean 
##                                   0                                   0 
##    firstorder_MeanAbsoluteDeviation                   firstorder_Median 
##                                   0                                   0 
##                  firstorder_Minimum                    firstorder_Range 
##                                   0                                   0 
##          firstorder_RootMeanSquared                 firstorder_Skewness 
##                                   0                                   0 
##        firstorder_StandardDeviation               firstorder_Uniformity 
##                                   0                                   0 
##                 firstorder_Variance                glcm_Autocorrelation 
##                                   0                                   0 
##              glcm_ClusterProminence                   glcm_ClusterShade 
##                                   0                                   0 
##                glcm_ClusterTendency                       glcm_Contrast 
##                                   0                                   0 
##                    glcm_Correlation              glcm_DifferenceEntropy 
##                                   0                                   0 
##              glcm_DifferenceAverage                    glcm_JointEnergy 
##                                   0                                   0 
##                   glcm_JointEntropy                             glcm_Id 
##                                   0                                   0 
##                            glcm_Idm                           glcm_Imc1 
##                                   0                                   0 
##                           glcm_Imc2                           glcm_Idmn 
##                                   0                                   0 
##                            glcm_Idn                glcm_InverseVariance 
##                                   0                                   0 
##             glcm_MaximumProbability                     glcm_SumAverage 
##                                   0                                   0 
##                     glcm_SumEntropy              glrlm_ShortRunEmphasis 
##                                   0                                   0 
##               glrlm_LongRunEmphasis        glrlm_GrayLevelNonUniformity 
##                                   0                                   0 
##        glrlm_RunLengthNonUniformity                 glrlm_RunPercentage 
##                                   0                                   0 
##       glrlm_LowGrayLevelRunEmphasis      glrlm_HighGrayLevelRunEmphasis 
##                                   0                                   0 
##  glrlm_ShortRunLowGrayLevelEmphasis glrlm_ShortRunHighGrayLevelEmphasis 
##                                   0                                   0 
##   glrlm_LongRunLowGrayLevelEmphasis  glrlm_LongRunHighGrayLevelEmphasis 
##                                   0                                   0
  • Accuracy
## [1] "train$age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   42.51   62.98   69.95   68.77   76.20   87.13      16
## [1] "imputed$age"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   42.51   63.00   70.00   68.88   76.35   87.13

6.1.3 Do the same imputation for test

## Warning: Number of logged events: 2
## 
##  iter imp variable
##   1   1  age
##   1   2  age
##   1   3  age
##   1   4  age
##   1   5  age
##   2   1  age
##   2   2  age
##   2   3  age
##   2   4  age
##   2   5  age
##   3   1  age
##   3   2  age
##   3   3  age
##   3   4  age
##   3   5  age
##   4   1  age
##   4   2  age
##   4   3  age
##   4   4  age
##   4   5  age
##   5   1  age
##   5   2  age
##   5   3  age
##   5   4  age
##   5   5  age
## Warning: Number of logged events: 25
##    PatientID Event Histology Mstage Nstage SourceDataset Tstage     age
## 1         13    NA         4      0      0             1      4 44.3970
## 2        155    NA         1      0      3             1      1 63.3183
## 3        404    NA         2      0      2             1      2 64.7255
## 4        407    NA         4      0      0             1      2 65.3635
## 5          9    NA         1      0      0             2      2 50.0000
## 6         49    NA         6      0      0             1      2 86.1410
## 7         55    NA         3      0      0             1      1 75.2663
## 8        200    NA         3      0      0             1      1 85.4511
## 9        170    NA         2      0      3             1      1 69.8727
## 10       387    NA         2      0      3             1      2 52.8569
##    SurvivalTime shape_Compactness1
## 1      788.4177         0.02888522
## 2      427.6501         0.03194837
## 3      173.5872         0.01599883
## 4      389.8780         0.03135766
## 5     1580.7672         0.01781454
## 6      472.5234         0.03816202
## 7     1970.9725         0.03699879
## 8      530.4248         0.03373779
## 9     1067.4630         0.01929334
## 10     378.3248         0.02531470

6.2 Preprocessing Train and Test

In order to measure the AUC of each model we need to split the data randomly (with seed) into two equal parts:

## [1] 151  62
## [1] 149  62

In order to align the survival and the classification models, we will focus on the probability of reaching event over the first quantile(test$SurvivalTime)[[2]] days.

  • We also need to create a classification-centric outcome variable. This will measure how many patients reached event or not within the chosen period. Here we look for a censor feature of 1 (i.e. the event happened) under the chosen period to set the outcome to 1, everything else is set to 0:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.457   1.000   1.000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4228  1.0000  1.0000
  • Now we can easily get an AUC score on the probability of reaching event within our allotted period choice

6.3 Survival Model (ranger)

## Surv(SurvivalTime, Event) ~ Histology + Mstage + Nstage + SourceDataset + 
##     Tstage + age + shape_Compactness1 + shape_Compactness2 + 
##     shape_Maximum3DDiameter + shape_SphericalDisproportion + 
##     shape_Sphericity + shape_SurfaceArea + shape_SurfaceVolumeRatio + 
##     shape_VoxelVolume + firstorder_Energy + firstorder_Entropy + 
##     firstorder_Kurtosis + firstorder_Maximum + firstorder_Mean + 
##     firstorder_MeanAbsoluteDeviation + firstorder_Median + firstorder_Minimum + 
##     firstorder_Range + firstorder_RootMeanSquared + firstorder_Skewness + 
##     firstorder_StandardDeviation + firstorder_Uniformity + firstorder_Variance + 
##     glcm_Autocorrelation + glcm_ClusterProminence + glcm_ClusterShade + 
##     glcm_ClusterTendency + glcm_Contrast + glcm_Correlation + 
##     glcm_DifferenceEntropy + glcm_DifferenceAverage + glcm_JointEnergy + 
##     glcm_JointEntropy + glcm_Id + glcm_Idm + glcm_Imc1 + glcm_Imc2 + 
##     glcm_Idmn + glcm_Idn + glcm_InverseVariance + glcm_MaximumProbability + 
##     glcm_SumAverage + glcm_SumEntropy + glrlm_ShortRunEmphasis + 
##     glrlm_LongRunEmphasis + glrlm_GrayLevelNonUniformity + glrlm_RunLengthNonUniformity + 
##     glrlm_RunPercentage + glrlm_LowGrayLevelRunEmphasis + glrlm_HighGrayLevelRunEmphasis + 
##     glrlm_ShortRunLowGrayLevelEmphasis + glrlm_ShortRunHighGrayLevelEmphasis + 
##     glrlm_LongRunLowGrayLevelEmphasis + glrlm_LongRunHighGrayLevelEmphasis
##                           glcm_Idm                           glcm_Idn 
##                        0.010874783                        0.008679476 
##            glcm_MaximumProbability             glrlm_ShortRunEmphasis 
##                        0.006445334                        0.005767017 
##                glrlm_RunPercentage                          glcm_Imc1 
##                        0.005115244                        0.004855692 
##                    firstorder_Mean                  glcm_JointEntropy 
##                        0.004413386                        0.004139470 
##                    glcm_SumEntropy       glrlm_RunLengthNonUniformity 
##                        0.003978832                        0.003897940 
##                             Tstage                          glcm_Idmn 
##                        0.003583321                        0.003427321 
##                  firstorder_Median                                age 
##                        0.003387350                        0.003266305 
##                firstorder_Variance                 firstorder_Entropy 
##                        0.003228114                        0.003219570 
## glrlm_ShortRunLowGrayLevelEmphasis                      SourceDataset 
##                        0.003183818                        0.003181861 
##   firstorder_MeanAbsoluteDeviation                    glcm_SumAverage 
##                        0.003039612                        0.002837739
  • The orange PatientID has more probability to survive.
## [1] "Orange PatientID: "
##   Event Tstage age SurvivalTime
## 1     0      2  66         1378
## [1] "Blue PatientID: "
##    Event Tstage     age SurvivalTime
## 56     1      4 62.4476          303
  • Once we have our survival_model model object, we can take a look at some probabilities of survival (this is just for illustrative purposes as we haven’t split our data set yet). Let’s look at two patients - row 1 and row 56:

6.3.1 Scoring the Random Forest Survival Model

First we get the basic survival prediction using our validation split set and then we flip the probability of the period of choice and get the AUC score:

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = 1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times ==         period_choice)])
## 
## Data: 1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times == period_choice)] in 86 controls (validate_df_classification$ReachedEvent 0) < 63 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.7171

7 GBM Classification modeling

## ReachedEvent ~ Histology + Mstage + Nstage + SourceDataset + 
##     Tstage + age + shape_Compactness1 + shape_Compactness2 + 
##     shape_Maximum3DDiameter + shape_SphericalDisproportion + 
##     shape_Sphericity + shape_SurfaceArea + shape_SurfaceVolumeRatio + 
##     shape_VoxelVolume + firstorder_Energy + firstorder_Entropy + 
##     firstorder_Kurtosis + firstorder_Maximum + firstorder_Mean + 
##     firstorder_MeanAbsoluteDeviation + firstorder_Median + firstorder_Minimum + 
##     firstorder_Range + firstorder_RootMeanSquared + firstorder_Skewness + 
##     firstorder_StandardDeviation + firstorder_Uniformity + firstorder_Variance + 
##     glcm_Autocorrelation + glcm_ClusterProminence + glcm_ClusterShade + 
##     glcm_ClusterTendency + glcm_Contrast + glcm_Correlation + 
##     glcm_DifferenceEntropy + glcm_DifferenceAverage + glcm_JointEnergy + 
##     glcm_JointEntropy + glcm_Id + glcm_Idm + glcm_Imc1 + glcm_Imc2 + 
##     glcm_Idmn + glcm_Idn + glcm_InverseVariance + glcm_MaximumProbability + 
##     glcm_SumAverage + glcm_SumEntropy + glrlm_ShortRunEmphasis + 
##     glrlm_LongRunEmphasis + glrlm_GrayLevelNonUniformity + glrlm_RunLengthNonUniformity + 
##     glrlm_RunPercentage + glrlm_LowGrayLevelRunEmphasis + glrlm_HighGrayLevelRunEmphasis + 
##     glrlm_ShortRunLowGrayLevelEmphasis + glrlm_ShortRunHighGrayLevelEmphasis + 
##     glrlm_LongRunLowGrayLevelEmphasis + glrlm_LongRunHighGrayLevelEmphasis

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = validate_predictions)
## 
## Data: validate_predictions in 86 controls (validate_df_classification$ReachedEvent 0) < 63 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.7213

Now that both models can predict the same period and the probability of reaching the event, we average them together and see how they help each other (straight 50/50 here which may not be the best mix)

7.1 Blend both models (Survival and Classification) together

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = validate_df_classification$ReachedEvent,     predictor = (validate_predictions + (1 - suvival_predictions$survival[,         which(suvival_predictions$unique.death.times == period_choice)]))/2)
## 
## Data: (validate_predictions + (1 - suvival_predictions$survival[, which(suvival_predictions$unique.death.times == period_choice)]))/2 in 86 controls (validate_df_classification$ReachedEvent 0) < 63 cases (validate_df_classification$ReachedEvent 1).
## Area under the curve: 0.7239